COVID 19 Case Rates and Observed Mask Usage in NYC

Toggle between layers to view COVID 19 rates from NYC Health and observed mask usage by the New York Times. Data is divided by ZCTA.

Click on an area for more info.

NYC Health (Dates)

Using data from NYC Health, these layers map out city-wide COVID 19 Rates per 100,000 People in each ZCTA. This map includes weekly rates from July 13 to August 10, 2020. Mouseover of ZCTA shows area name, borough, ZCTA, and case rate as designated by NYC Health.

New York Times Mask Observations (NYT Obs)

Using data from the New York Times article “Are New Yorkers Wearing Masks?”, this layer maps out observed mask usage rates by the Times’ reporters between July 27 to July 30, 2020. The additional NYT Obs layers shows observed mask usage rates based on perceived gender. The ZCTAs where the intersections of the Times reporters were used to map out observed mask usage rates and to compare NYC Health data with. Mouseover of the ZCTA shows area name, borough, and intersection of observation as reported by the Times. ZCTAs were found by looking up the locations of the intersecitons.

Link to Article


Overview

Methodology

This report aims to compare the observed mask usage rates in select areas in NYC to the COVID 19 data taken around the same time. The data is shown with visuals such as maps and charts for comparison across areas in NYC. Each area in this analysis is divided by ZCTA which is used by NYC Health and can be inferred from the NY Times article. The dates chosen are due to the estimated 2 week time period for COVID 19-related symptoms show.

Linear regression analysis is conducted later to determine if observed mask rates and population are signficiant predictors of case rates.

Motive

I wanted to see if the observed mask usage rates from the Times would correlate with the the spread of COVID 19 in NYC, and also wanted an opportunity to use data visualisations to compare related data sets.

Sources

NYC Health

NYC Health COVID-19: Data

NYC Health COVID-19 Data GitHub Repo

NYC Health publishes their official COVID 19 data their website and Github. I used the data-by-modzcta.csv data sets from the commits on the following dates: 07-13, 07-20, 07-27, 08-03, 08-10. Within these data sets, I used the COVID_CASE_COUNT and PERCENT_POSITIVE variables to determine the spread of COVID 19. Here’s an overview from NYC Health:

COVID_CASE_RATE - Rate of confirmed cases per 100,000 people by modified ZCTA

I chose COVID_CASE_RATE as it was an easy to understand metric that adjusted for population within each ZCTA.

A major limitation of this dataset, as well as most relating to COVID 19 is that this only includes detected and confirmed cases of the disease. There is the possibility that there are cases of the virus that contributes to spread which are not included in this data set.

The New York Times - “Are New Yorkers Wearing Masks?”

The New York Times - Are New Yorkers Wearing Masks?

The article from the New York Times shares their observed mask usage in 14 locations throughout New York City. These observations were taken between July 27-30 in the daytime between 09:00 and 19:00. According to the Times, 340 to 567 were observed in each location and only counted pedestrians (people in cars, skateboards, bikes were excluded). The data is presented with the intersection observations were made, an overall percentage of mask usage, and percentages of mask usage for men and women. Gender was determined by someone’s ‘apparent gender’. I manually compiled the Times’ observations across the city and used the ZCTAs that the intersections mentioned were located.

A major limitation of this data set is that it only provides one observation per location so trends over time cannot be inferred. Observations were made only on specific intersections and are not necessarily representative of the entire ZCTA. The Times stated that their selected observation spots were chosen due to their expected population density.

‘Are New Yorkers Wearing Masks’ Data

Visual

This chart depicts the observed mask usage rates from the Times. The data is a percentage of the observed population in these areas that were seen using masks. Some notable highlights include the very high mask usage in Flushing, an area with strong ties to mainland China, and in areas like Corona that were heavily impacted by the spread of COVID 19. It should be noted that Rockaway Beach is an outlier as it is a beach, while the other locations were busy intersections.

NYC Health - Case Rate per 100,000 people

Visual

Background

This chart visualises COVID 19 case rates per 100,000 people in the same ZCTAs that the Times reported on. As mentioned earlier, data is taken from 2 weeks before and after the Times reporting on July 27-30 and samples data from the following dates: July 13, July 20, July 27, August 3, and August 10. The ZCTA with the highest case rate within this set is Corona (11368), which borders the ZCTA with the highest case rate in NYC (East Elmhurst, 11369). Another observation is that the two locations reported on in Manhattan (Harlem, East Village) have relatively lower case rates than the rest of the set.

Analysis

Merging the Data - Average Change in Case Rates

In order to capture the impact of masks throughout the 5 weeks, I calculated the average rate of change in case rates and used this to compare with the observed mask rates. I created a linear regression model using this average rate and the observed mask rates to determine if there is a statistically significant relationship between these two variables.

## `summarise()` ungrouping output (override with `.groups` argument)

Since the Rockaway Beach (ZIP: 11694) observed mask rates were taken in conditions that are significantly different from the rest of the observations, I’ve excluded them from the regression models.

## 
## Call:
## lm(formula = avg_rate ~ obs_mask, data = nyc.avg.rate.mask)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0034983 -0.0009330 -0.0000025  0.0005589  0.0035218 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.003793   0.004452   0.852    0.412
## obs_mask    0.008097   0.005568   1.454    0.174
## 
## Residual standard error: 0.002174 on 11 degrees of freedom
## Multiple R-squared:  0.1613, Adjusted R-squared:  0.08503 
## F-statistic: 2.115 on 1 and 11 DF,  p-value: 0.1738

When looking at the actual model, however, we are unable to reject the null hypothesis as p > .05 at .174. Based on this, I cannot say that masks have a relationship to the change rates in COVID 19 cases.

When looking at the diagnostic plots, we can find that there is not a clear linear relationship in this model, and that there is a heteroscedasticity problem with the data.

Merging the Data - Case Rates

Instead of looking at average changes in case rates, I decided to also take a look the case rates directly. I took data from the beginning, middle, and end of my time frame and created linear models for each with observed mask rates. Similar to the previous section, I also removed the Rockaway Beach data.

## 
## Call:
## lm(formula = COVID_CASE_RATE ~ obs_mask, data = mask0713)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1807.1  -749.6   106.8   717.7  1897.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5011       2197   2.281   0.0435 *
## obs_mask       -2973       2747  -1.082   0.3024  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1073 on 11 degrees of freedom
## Multiple R-squared:  0.0962, Adjusted R-squared:  0.01404 
## F-statistic: 1.171 on 1 and 11 DF,  p-value: 0.3024
## 
## Call:
## lm(formula = COVID_CASE_RATE ~ obs_mask, data = mask0727)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1825.3  -766.9   105.7   730.8  1926.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5046       2221   2.272   0.0442 *
## obs_mask       -2957       2778  -1.065   0.3099  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1085 on 11 degrees of freedom
## Multiple R-squared:  0.0934, Adjusted R-squared:  0.01098 
## F-statistic: 1.133 on 1 and 11 DF,  p-value: 0.3099
## 
## Call:
## lm(formula = COVID_CASE_RATE ~ obs_mask, data = mask0810)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1860.8  -780.0   109.6   745.1  1939.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5146       2257   2.280   0.0436 *
## obs_mask       -3013       2823  -1.068   0.3086  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1102 on 11 degrees of freedom
## Multiple R-squared:  0.09388,    Adjusted R-squared:  0.0115 
## F-statistic:  1.14 on 1 and 11 DF,  p-value: 0.3086

When looking at the actual models, however, we are unable to reject the null hypothesis as p > .05 at with p = ~.30 for each model. Based on this, I cannot say that masks have a relationship to the COVID 19 case rate.

When looking at the diagnostic plots, we can find that there is not a clear linear relationship in this model, and that there is a heteroscedasticity problem with the data.

Additional Analysis

Finally, I decided to see if population was a statistically significant predictor of case rates. Since the POP_DENOMINATOR is the known population of an area at the start of the pandemic, all the data needed was in the NYC Health dataset. I sampled the case rate and population for each ZCTA which was randomly selected from a date. For example, the data for 10001 could be from 7/13, 7/20,7/27, 8/3, or 8/10. I created a linear regression model from this dataset and plotted it below:

Population and Case Rates

## 
## Call:
## lm(formula = COVID_CASE_RATE ~ POP_DENOMINATOR, data = nyc.covid19.data_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1638.22  -729.09   -17.91   664.11  2280.89 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.198e+03  1.387e+02  15.852   <2e-16 ***
## POP_DENOMINATOR 5.635e-03  2.556e-03   2.204   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 895.6 on 175 degrees of freedom
## Multiple R-squared:  0.02702,    Adjusted R-squared:  0.02146 
## F-statistic:  4.86 on 1 and 175 DF,  p-value: 0.0288

When looking at this model, we can reject the null hypothesis as clearly p < .05. When looking closer at the model, however, we can find that the R^2 value is near 0 which means that the model does not explain much of the variation in the data.

Model Info

Reflection

With this project, my main goal was to build my data storytelling skills with interactive maps and data visualisations. The initial focus and effort around learning how to create an interactive map is why the unified.plot section is it’s own R file.

My first challenge was learning how to connect data from NYC Health with spatial data so I could create an interactive map. At this point, I realised I would also need to find a source of spatial data. Since the NYC Health data used ZCTAs, I decided it would be best to use spatial data based on ZCTAs. Fortunately for me, the US Census has spatial data based on ZCTAs and the tigiris package could both retrieve and merge the spatial data by ZCTAs. Once this was taken care of it was straightforward to both merge the spatial data with the data from the NY Times article, and to use Leaflet to create an interactive map.

My second goal was to continue creating interactive data visualisations. I continued to use Plotly and focused more on adding additional layers (when relevant) and ensuring that the interactive-ness of the charts would add value and information for the reader. I originally did not plan to have regression analysis in this report, however, as I continued to work on this report, I was interested in finding out if mask usage or population count could be a predictor of COVID 19 rates from a statistical perspective. This gave me an opportunity to conduct regression analyses and to create interactive charts of said analysis.

I really enjoyed working on this project and I felt I took a lot away in terms of skills learned and insight from the analysis. For future projects, I think I would like to focus on story telling through interactive data visuals alone as a challenge to build more data viz skills. Additionally, I think I would like to limit the scope and prepare a more detailed plan so I could finish it in a more timely manner.